In this project I’m going to investigate white wine quality. The final result will be predictive model of wine quality based on chemical properties. In the first section presented data exploration. In the second part building predictive model.
df <- read.csv('wineQualityWhites.csv')
dim(df)
## [1] 4898 13
str(df)
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
summary(df)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
More precise look at quality column.
table(df$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
So it’s more useful and suitable to create ordered factor.
df$quality.factor <- factor(df$quality, ordered=TRUE)
df$X <- NULL
library(ggplot2)
library(GGally)
library(gridExtra)
## Loading required package: grid
ggplot(data=df, aes(x=fixed.acidity)) + geom_histogram(aes(fill=..count..), binwidth = 0.2)
Looks very normal, let’s add boxplot.
ggplot(data=df, aes(y=fixed.acidity, x = quality)) + geom_boxplot(aes(color=quality.factor))
There is no some significant difference between quality and fixed acidity. Remind that fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
Move to next variable. It’s volatile acidity : the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
ggplot(data=df, aes(x=volatile.acidity)) + geom_histogram(aes(fill=..count..), binwidth=0.02)
Look at relationship with quality.
ggplot(data=df, aes(y=volatile.acidity, x = quality)) + geom_boxplot(aes(fill=quality.factor))
There is no logical separation. So i’m going to combine acidity variables with quality.
ggplot(data=df, aes(y=fixed.acidity, x = volatile.acidity)) + geom_point(aes(color=quality.factor)) + scale_colour_brewer(type="seq", palette=3)
Unfortunately, no visual understandable separation using this two features. Go forward to next.
g1 <- ggplot(data=df, aes(x=citric.acid)) + geom_histogram(aes(fill=..count..), binwidth=0.03)
g2 <- ggplot(data=df, aes(y=citric.acid, x = quality)) + geom_boxplot(aes(fill=quality.factor))
grid.arrange(g1,g2, ncol=1)
Look at residual.sugar variable.
g1 <- ggplot(data=df, aes(x=residual.sugar)) + geom_histogram(aes(fill=..count..), binwidth=0.5)
g2 <- ggplot(data=df, aes(y=residual.sugar, x = quality)) + geom_boxplot(aes(color=quality.factor))
grid.arrange(g1,g2, ncol=1)
Combining residual.sugar and citric.acid variables to determine some linear separation.
ggplot(data=df, aes(y=residual.sugar, x = citric.acid)) + geom_point(aes(color=quality.factor)) + scale_colour_brewer(type="seq", palette=3)
Move to next variable - chlorides.
g1 <- ggplot(data=df, aes(x=chlorides)) + geom_histogram(aes(fill=..count..), binwidth=0.005)
g2 <- ggplot(data=df, aes(y=chlorides, x = quality)) + geom_boxplot(aes(color=quality.factor))
grid.arrange(g1,g2, ncol=1)
## Warning: position_stack requires constant width: output may be incorrect
More meaningful variable. Easily can see to many outliers in wine with quality 5 and 6.
Exploring together free silfur and total sulfur dioxides.
g1 <- ggplot(data=df, aes(x=free.sulfur.dioxide)) + geom_histogram(aes(fill=..count..), binwidth=5)
g2 <- ggplot(data=df, aes(y=free.sulfur.dioxide, x = quality)) + geom_boxplot(aes(color=quality.factor))
g3 <- ggplot(data=df, aes(x=total.sulfur.dioxide)) + geom_histogram(aes(fill=..count..), binwidth=5)
g4 <- ggplot(data=df, aes(y=total.sulfur.dioxide, x = quality)) + geom_boxplot(aes(color=quality.factor))
grid.arrange(g1,g2,g3,g4, ncol=2)
ggplot(data=df, aes(y=free.sulfur.dioxide, x = total.sulfur.dioxide)) + geom_jitter(alpha=1/5,aes(color=quality.factor)) + scale_colour_brewer(type="seq", palette=3)
Unfortunately no meaningful separation yet. Going to next variables density and Ph. Density is too simillar for all kinds of wines, so i decided to use exp square root transfort.
g1 <- ggplot(data=df, aes(x=exp(density))) + geom_histogram(aes(fill=..count..), binwidth=0.001)
g2 <- ggplot(data=df, aes(y=exp(density), x = quality)) + geom_boxplot(aes(color=quality.factor))
g3 <- ggplot(data=df, aes(x=pH)) + geom_histogram(aes(fill=..count..), binwidth=0.02)
g4 <- ggplot(data=df, aes(y=pH, x = quality)) + geom_boxplot(aes(color=quality.factor))
grid.arrange(g1,g2,g3,g4, ncol=2)
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
We can see some interesting trends from this plots, like with less density -> quality higher. The same is for pH, but in median thinking. Combine this two features to look at this data.
ggplot(data=df, aes(y=exp(exp(density)), x = exp(pH))) + geom_jitter(aes(color=quality.factor)) + scale_colour_brewer(type="seq", palette=3) + geom_abline(intercept = 14.85, slope = 0)
Look like under black line there is more chance that wine quality is high. In someway first result.
Okey. And the last two variables alcohol and sulphates.
g1 <- ggplot(data=df, aes(x=alcohol)) + geom_histogram(aes(fill=..count..), binwidth=0.1)
g2 <- ggplot(data=df, aes(y=alcohol, x = quality)) + geom_boxplot(aes(color=quality.factor))
g3 <- ggplot(data=df, aes(x=sulphates)) + geom_histogram(aes(fill=..count..), binwidth=0.02)
g4 <- ggplot(data=df, aes(y=sulphates, x = quality)) + geom_boxplot(aes(color=quality.factor))
grid.arrange(g1,g2,g3,g4, ncol=2)
## Warning: position_stack requires constant width: output may be incorrect
In median thinking more alcohol in wine implies higher quality, based on this plots, unfortunately sulphates is now so separable for different wine qualities. Combining isn’t suitable due to low variability of sulphates variables.
Let’s look at correlations between all variables and numeric analogue of quality.
library(corrgram)
corrgram(df, type="data", lower.panel=panel.conf,
upper.panel=panel.shade, main= "Corrgram for wine quality dataset", order=T, cex.labels=1.4)
Notes:
From corrgram we can conclude next important variables for quality prediction (decision is made using confidence intervals):
ggplot(data=subset(df, density < 1.005), aes(x=alcohol, y = density, color = quality.factor)) + xlab("Alcohol") +
ylab("Density") + ggtitle("Alcohol and density by quality") +
stat_binhex()
We easily can see some patterns here. This patterns is small clusters where quality is the same. This plot is awesome, it shows quality, density, alcohol relationship. With low alcohol or high density it’s more usual to be low quality wine.
Next plot is about pH and chlorides. They both has high absolute correlation among others variables.
ggplot(data=subset(df, chlorides < 0.2), aes(x=pH, y = chlorides, color = quality.factor)) + xlab("pH") +
ylab("Chlorides") + ggtitle("pH and Chlorides by quality") +
stat_binhex()
We got more patterns. High chlorides means low quality. Based on this two plots we can easily predict whether wine is low or high quality, but this is not our case, so we move to prediction.
ggplot(data=subset(df,density < 1.005) , aes(x=alcohol, y=density, color=quality.factor)) + geom_point() + facet_wrap(~quality.factor) + geom_smooth(colour='black')
We can see some bound trends between this variables across different wine qualities.
ggplot(data=subset(df, chlorides < 0.2), aes(x=pH, y = chlorides, color = quality.factor)) + geom_point() + facet_wrap(~quality.factor) + geom_smooth(colour='black')
Unfortunately here we can observate some stability between this variables, trend is the same, no unusual things.
Let’s explore the highest correlation variable by quality.
ggplot(data=subset(df, density < 1.005), aes(x=density, y = residual.sugar, color = quality.factor)) + geom_point() + facet_wrap(~quality.factor) + geom_smooth(colour='black')
I want to use simple tree model as my main model.
library(rpart)
library(rattle)
fit <- rpart(quality.factor ~ alcohol + density + pH + chlorides, data=df)
printcp(fit)
##
## Classification tree:
## rpart(formula = quality.factor ~ alcohol + density + pH + chlorides,
## data = df)
##
## Variables actually used in tree construction:
## [1] alcohol
##
## Root node error: 2700/4898 = 0.55125
##
## n= 4898
##
## CP nsplit rel error xerror xstd
## 1 0.040185 0 1.00000 1.00000 0.012892
## 2 0.020741 2 0.91963 0.92296 0.012958
## 3 0.010000 3 0.89889 0.90222 0.012960
fancyRpartPlot(fit)
Not good not bad, but acceptable as for initial solution.
ggplot(data=df, aes(y=alcohol, x = quality)) + geom_boxplot(aes(color=quality.factor)) +
xlab("Quality") + ylab("Alcohol") + ggtitle("Alcohol by quality")
The median of alcohol variable by quality is likely to be higher for higher quality white wine. This follows that one of the main features of high quality wine is highthe percent alcohol content of this wine.
ggplot(data=df, aes(y=exp(exp(density)), x = exp(pH))) + geom_jitter(aes(color=quality.factor)) + scale_colour_brewer(type="seq", palette=3) + geom_abline(intercept = 14.85, slope = 0) + ylab("Transformed density") + xlab("Transformed pH") + ggtitle("Ph and density by quality Separation")
There is exists some soft separation line between higher and lower quality wines. One of this lines can be straight line of transformed density equal 14.85.
ggplot(data=subset(df, chlorides < 0.2), aes(x=pH, y = chlorides, color = quality.factor)) + xlab("pH") +
ylab("Chlorides") + ggtitle("pH and Chlorides by quality") +
stat_binhex()
From this plot we can clearly see that data have some clusters, where white wine quality is the same. It seems to be one of the solutions for classifying wines with this known boundaries. Higher chlorides variables means lower wine quality. pH has some interval for good wines.
The white wine quality data set contains information on almost 4898 wines, their chemical properties and wine quality from best experts (i believe). I’ve asked a question how we can predict wine quality using only information about chemical properties of this wine. Quality measures from 0 (worst) to 10 (best). I started by understanding the individual variables in the data set and their influence on wine quality. I’ve transformed quality from numeric to ordered factor. During exploration I’ve found some linear patterns how to separate low and high quality wines. The highest influence on wine quality is alcohol content in wine, it’s has the highest correlation. During correlation analysis I’ve found four important variables for this task are Alcohol, pH, density and chlorides. This variables I’ve included in simple tree model for predicting wine quality. I’ve obtained 0.55 root node error. It’s quite higher. From multivariate plots i can conclude, that there are non linear patterns in this data set. More better model for prediction is SVM, it’s gives high accuracy as described in this article http://www3.dsi.uminho.pt/pcortez/white.pdf.